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Abstract.  This  paper  presents  our  ongoing  effort  on  developing  a  principled 
methodology  for  automatic  ontology  mapping  based  on  BayesOWL,  a  probabil¬ 
istic  framework  we  developed  for  modeling  uncertainty  in  semantic  web.  In 
this  approach,  the  source  and  target  ontologies  are  first  translated  into  Bayesian 
networks  (BN);  the  concept  mapping  between  the  two  ontologies  are  treated  as 
evidential  reasoning  between  the  two  translated  BNs.  Probabilities  needed  for 
constructing  conditional  probability  tables  (CPT)  during  translation  and  for 
measuring  semantic  similarity  during  mapping  are  learned  using  text  classifica¬ 
tion  techniques  where  each  concept  in  an  ontology  is  associated  with  a  set  of 
semantically  relevant  text  documents,  which  are  obtained  by  ontology  guided 
web  mining.  The  basic  ideas  of  this  approach  are  validated  by  positive  results 
from  computer  experiments  on  two  small  real-world  ontologies. 


1  Introduction 

Uncertainty  concerns  every  aspect  of  semantic  web  ontologies.  In  many  applications, 
overlapping  between  concepts/classes  cannot  be  represented  logically  by  OWL  con¬ 
structs.  Even  if  they  can,  the  degree  of  overlapping  is  not  represented  (e.g.,  how  close 
a  class  A  is  to  its  super  class  B'!).  A  description  about  an  unknown  concept  or  object 
input  to  an  OWL  reasoner  may  be  uncertain  (e.g.,  x  is  an  instance  of  class  A  and  is 
moderately  likely  to  have  property  p  related  with  class  B).  In  a  previous  work,  we 
have  developed  a  Bayesian  network  based  framework  BayesOWL ,  to  address  repre¬ 
sentation  and  reasoning  with  uncertainty  within  a  single  ontology  ([5],  [6]). 

Uncertainty  becomes  more  prevalent  in  concept  mapping  between  two  ontologies 
where  it  is  often  the  case  that  a  concept  defined  in  one  ontology  can  only  find  partial 
matches  to  one  or  more  concepts  in  another  ontology.  Semantic  similarities  between 
concepts  are  difficult,  if  not  impossible  to  be  represented  logically,  but  can  easily  be 
represented  probabilistically.  This  has  motivated  recent  development  of  ontology 
mapping  taking  probabilistic  approaches  (GLUE  [7],  CAIMAN  [11],  OntoMapper 
[19],  and  OMEN  [13])  (See  [14]  for  a  survey  of  existing  approaches  to  ontology 
mapping,  including  those  based  on  logical  translation,  syntactical  and  linguistic 
analysis).  However,  these  existing  approaches  fail  to  completely  address  uncertainty 
in  mapping.  For  example,  GLUE  captures  similarity  between  two  concepts  ontol:A 
and  onto2:B  by  joint  probability  distribution  P(A,  B)  obtained  by  text  classification  of 
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exemplars  (semantically  relevant  text  documents)  to  each  concept.  Then  ontol:A  is 
mapped  to  onto2:C  whose  similarity  to  ontol:A,  measured  by,  say  their  Jaccard  coef¬ 
ficients  [21]  (computed  from  the  joint  distribution),  passes  a  threshold  and  is  highest 
among  all  concepts  in  onto2.  Here,  ontol:A  is  taken  as  (semantically)  equivalent  to 
onto2:C,  the  degree  of  similarity  between  them  will  not  be  considered  in  future  rea¬ 
soning  (e.g.,  subsumption  within  onto2).  Also  ignored  are  the  other  concepts  that  are 
also  similar  to  ontol:A  (albeit  at  smaller  degree). 

The  work  reported  in  this  paper  extends  BayesOWL  in  a  number  of  significant 
ways  so  that  uncertainty  in  ontology  mapping  can  be  dealt  with  properly.  As  depicted 
in  Figure  1  below,  this  new  framework  consists  of  three  components:  1)  a  text  classi¬ 
fication  based  learner  to  learn  from  web  data  the  probabilistic  ontological  informa¬ 
tion  within  individual  ontologies  and  between  concepts  in  two  different  ontologies;  2) 
a  BayesOWL  module  to  translate  given  ontologies  (together  with  the  learned  uncer¬ 
tain  information)  into  BNs;  and  3)  a  concept  mapping  module  which  takes  a  set  of 
learned  raw  similarities  as  input  and  finds  mappings  between  concepts  from  two 
different  ontologies  based  on  evidential  reasoning  across  two  BNs. 


Before  describing  the  BN  Mapping  module  and  the  learner  in  detail  (Sections  3 
and  4),  we  first  provide  some  background  information  in  Section  2.  This  includes  a 
brief  summary  of  BayesOWL,  and  introductions  to  Jeffrey’s  rule  and  iterative  propor¬ 
tional  fitting  procedure  (IPFP),  two  techniques  used  in  this  work.  Methods  and  results 
of  computer  experiments  with  two  small  ontologies  are  given  in  Section  5.  The  paper 
concludes  with  discussions  and  directions  of  future  research  in  Section  6. 


2  Background 

As  background,  we  briefly  introduce  Jeffrey’s  rule,  IPFP,  and  BayesOWL  here. 


2.1  Techniques  for  Updating  Probability  Distributions 

Two  techniques  for  updating  a  probability  distribution  by  another  distribution  used  in 
this  work  are  briefly  described  below. 

Jeffrey's  rule,  also  known  as  rule  of  probability  kinematics  or  J-conditioning,  was 
proposed  by  Richard  Jeffrey  [9]  to  revise  a  probability  measure  (e.g.,  a  joint  distribu¬ 
tion  P(x) )  by  another  probability  function  (e.g.,  a  prior  Q(xt)  in  another  distribution). 
The  rule  can  be  written  as  follows  in  this  context:  ifP(x,) ,  our  belief  on  Xt  si  is 
changed  to  Q(xt) ,  then  the  beliefs  of  other  variables  X,^  e  X  shall  be  changed  to 

Q(Xj )  =  I  P(xj  |  X,.  =  V,  )Q(X,  =  X,. )  (2.1 ) 

Xj 

if  P(xj  |  x^  is  invariant  with  respect  to  Q(Xj) . 

Jeffrey’s  rule  can  be  used  as  a  mechanism  to  update  a  distribution  by  soft  evidence, 
represented  as  a  distribution  such  as  Q(xt) .  The  rule  then  can  be  written  as 

P(Xj  |  se)  =  Q(Xj) ,  and 

Q(xj*i  )  =  p(Xj  I  se) 

=  X p(xj  I  xi  =  Xj)P{Xj  =  Xj  I  se) 

=  X p(Xj  I  xi  =xi)Q(Xl  =Xj) 

Xj 

Pearl  ([16],  [17])  has  shown  that  the  virtual  evidence,  a  method  widely  adopted  in 
Bayesian  network  (BN)  inference,  can  be  viewed  as  formally  equivalent  to  the  likeli¬ 
hood  ratio  version  of  Jeffrey’s  rule.  This  is  done  by  adding  a  virtual  node  ve,  which 
has  Xj  as  its  only  parent  in  the  BN,  related  by  likelihood  ratio: 

P(ve,\X,)_P(X,)QOQ  (2.4) 

P{ve,\Xj)  Q(X,)P(X,) 

when  Xj  is  binary.  Soft  evidence  update  (eqs.  2.2  and  2.3)  can  be  realized  by  BN 
belief  update  with  ve,  instantiated  to  true.  It  can  be  shown  that  L(X;)  for  multi¬ 
valued  variables  can  also  be  calculated  from  P(Xj)  and  Q(Xj)  [17]. 

As  will  be  seen  shortly,  we  use  Jeffrey’s  rule  to  propagate  probabilistically  beliefs 
on  variables  between  two  BNs  that  are  translated  from  two  ontologies  during  map¬ 
ping. 


(2.2) 

(2.3) 


IPFP  (Iterative  Proportional  Fitting  Procedure)  is  a  computational  procedure  that 
updates  a  given  distribution  Q0(x)  to  satisfy  a  set  of  probability  constraints 
R  =  {Rj(y')\  where  each  Rt{y ')  is  a  distribution  over  Y*  c  X  [10].  Roughly  speaking, 
IPFP  iterates  over  constraints  in  {/?,(>’'  )}  in  cycle,  at  each  iteration,  the  current  dis¬ 
tribution  is  updated  by  one  constraint  according  to 

Qk(x)  =  Qk-dx)-R/y‘\ 

Qk-i(y ) 


(2.5) 


It  has  been  proved  based  on  I-divergence  geometry  ([4],  [22])  that  IPFP  converges 
to  an  unique  distribution  Q"(x)  ,  which  1)  satisfies  all  R,(y')  in  R ,  i.e.,  Q*(y’)  = 
Ri(y')  for  R/  eR  ,  and  2)  has  the  smallest  Kullback-Leibler  distance  (or  I-divergence) 
to  Q0(x)  among  all  distributions  Q(x)  that  satisfy  all  constraints  in  R,  i.e., 

/(G*IIG(0))  =  ie*Wiog^^-  (2-6) 

is  minimized.  Q*(x)  is  called  /[-projection  of  Q0(x)  on  R.  Bock  [1]  and  Cramer  [2] 
extended  IPFP  to  conditional  IPFP  (CIPFP)  to  allow  constraints  with  the  form  of 
conditional  probability  distributions  and  proved  its  convergence. 

If  we  consider  Q(y‘)  as  soft  evidence  on  a  collection  of  variables  Y’ ,  then  IPFP 
can  be  considered  as  another  mechanism  of  processing  soft  evidence  [20].  The  differ¬ 
ence  between  Jeffrey’s  rule  and  IPFP  in  this  regard  is  that  the  former  requires  the 
invariance  of  domain  knowledge  (i.e.,  P{x,  xt)  remains  unchanged  in  Q(x) )  while 

the  latter  requires  minimizing  I-divergence  which  in  general  destroys  the  invariance 
in  the  updated  Q*(x) .  How  to  combine  these  two  techniques  together  when  used  in 
ontology  to  BN  translation  and  in  concept  mapping  will  be  given  in  Subsection  2.2 
and  Section  3. 


2.2  BayesOWL 

BayesOWL  ([5],  [6])  is  a  framework  which  augments  and  supplements  OWL  for 
representing  and  reasoning  with  uncertainty  based  on  Bayesian  networks.  This 
framework  provides  a  set  of  rules  and  procedures  for  direct  translation  of  an  OWL 
ontology  into  a  BN  structure  (a  directed  acyclic  graph  or  DAG)  and  a  method  based 
on  IPFP  that  utilizes  available  probability  constraints  about  classes  and  interclass 
relations  in  constructing  the  conditional  probability  tables  (CPTs)  of  the  BN.  The 
translated  BN,  which  preserves  the  semantics  of  the  original  ontology  and  is  consis¬ 
tent  with  the  probabilistic  constraints,  can  support  ontology  reasoning,  both  within 
and  across  ontologies,  as  Bayesian  inferences. 

Structural  translation  The  general  principle  underlying  the  structural  translation 
rules  is  that  all  classes  (specified  as  “subjects”  and  “objects”  in  RDF  triples  of  the 
OWL  file)  are  translated  into  nodes  in  BN,  and  an  arc  is  drawn  between  two  nodes  in 
BN  if  the  corresponding  two  classes  are  related  by  a  “predicate”  in  the  OWL  file, 
with  the  direction  from  the  superclass  to  the  subclass. 

The  model-theoretic  semantics  of  OWL  treats  the  domain  as  a  non-empty  collec¬ 
tion  of  individuals.  If  class  A  represents  a  concept,  the  node  it  is  translated  to  is 
treated  as  a  binary  random  variable  of  two  states  a  and  a  ,  and  we  inteipret  P(A  =  a ) 
as  the  prior  probability  or  one’s  belief  that  an  arbitrary  individual  belongs  to  class  A  , 
and  P(a  \  b)  as  the  conditional  probability  that  an  individual  of  class  B  also  belongs 

to  class  A  .  Similarly,  for  P(a)  ,  P(a\b)  ,  P(a  \  b) ,  and  P(a\b)  ,  we  interpret  the 
negation  as  “not  belonging  to”. 


Control  nodes  are  created  during  the  translation  to  facilitate  modeling  relations 
among  class  nodes  that  are  specified  by  OWL  logical  operators,  and  there  is  a  con¬ 
verging  connection  from  each  of  the  concept  nodes  involved  in  this  logical  relation  to 
its  specific  control  node.  There  are  five  types  of  control  nodes  in  total  corresponding 
to  the  five  types  of  logical  relations:  “and”  (owhintersectionOf),  “or”  (owhunionOf), 
“not”  (owhcomplementOf),  “disjoint”  (owhdisjointWith),  and  “same  as” 
(owhequivalentClass). 


Constructing  CPTs  The  nodes  in  the  DAG  obtained  from  the  structural  translation 
step  can  be  divided  into  two  disjoint  groups:  Xr,  regular  nodes  representing  concepts 
in  ontology,  and  Xc,  control  nodes  for  bridging  logical  relations.  The  CPT  for  a  con¬ 
trol  node  in  Xc  can  be  determined  by  the  logical  relation  it  represents  so  that  when  its 
state  is  “True”,  the  corresponding  logical  relation  holds  among  its  parent  nodes. 
When  all  the  control  nodes’  states  are  set  to  “True”  (denote  this  situation  as  CT),  all 
the  logical  relations  defined  in  the  original  ontology  are  held  in  the  translated  BN. 
The  remaining  issue  is  then  to  construct  the  CPTs  for  node  in  Xr  so  that  P(Xr\CT), 
the  joint  distribution  of  all  regular  nodes  in  the  subspace  of  CT,  is  consistent  with  all 
the  given  probabilistic  constraints  about  classes  and  relations  between  classes.  These 
constraints  include,  most  likely,  priors  for  classes  P(C),  conditionals  P(C\D)  for  rela¬ 
tions  between  classes  C  and  D.  Several  suggestions  have  been  made  to  encode  prob¬ 
ability  constraints  in  semantic  web  languages  (e.g.,  [6]  with  OWL,  and  [8]  with  RDF). 
These  constraints  can  be  obtained  from  the  ontology  designers  or  learned  from  data 
(an  approach  that  learns  these  constraints  from  web  is  described  in  Section  4). 

In  principle,  IPFP  can  be  applied  to  construct  CPTs  to  satisfy  all  the  given  prob¬ 
abilistic  constraints.  Two  difficulties  exist.  First,  as  we  mentioned  earlier,  direct  ap¬ 
plication  of  IPFP  may  destroy  the  existing  interdependencies  between  variables  (i.e., 
the  given  DAG  becomes  invalid).  Secondly,  IPFP  is  computationally  very  expensive 
since  every  entry  in  the  joint  distribution  of  the  BN  must  be  updated  at  each  iteration. 
To  overcome  these  difficulties,  we  developed  an  algorithm  named  D-IPFP  that  de¬ 
composes  IPFP  so  that  each  iteration  only  updates  a  small  portion  of  the  BN  that  are 
directly  involved  with  the  chosen  constraint,  and  the  update  is  done  only  to  CPTs 
while  keeping  the  DAG  of  the  network  intact  [18].  In  particular,  when  each  of  the 
given  constraints  involves  only  one  variable  C,  and  a  set  of  zero  or  more  of  its  par¬ 
ents  Lj ,  (2.5)  of  IPFP  becomes  [5] 


Qktei  K)  =  ft_i(c,.  I  ti,.)-- 
Qk(Cj  R,  |  71;) 


QicAP) 

Qk-M  I  h) 


v/  *  /' 


(2.7) 


The  BayesOWL  framework  can  support  common  ontology  reasoning  tasks  as  prob¬ 
abilistic  inferences  in  the  translated  BN.  For  example,  given  a  concept  description  e, 
it  can  answer  queries  about  concept  satisfiability  (whether  P(e\CT)  =  0),  about  con¬ 
cept  overlapping  (how  close  e  is  to  a  concept  C  as  P(e\C,CT)),  and  about  concept 
subsumption  (find  the  concept  which  is  most  similar  to  e )  by  defining  some  similarity 
measures  such  as  Jaccard  coefficient  [21]. 


3  Concept  Mapping  Between  Ontologies  Using  BN  Mapping 


It  is  often  the  case  when  attempting  to  map  concept  A  defined  in  Ontology  1  to  On¬ 
tology  2,  there  is  no  concept  in  Ontology  2  that  is  semantically  identical  to  A.  Instead, 
A  is  similar  to  several  concepts  in  Ontology  2  with  different  degree  of  similarity.  A 
solution  to  this  so-called  one-to-many  problem,  as  suggested  by  [19]  and  [7],  is  to 
map  A  to  the  target  concept  B  which  is  most  similar  to  A  by  some  measure.  This  sim¬ 
ple  approach  would  not  work  well  because  1)  the  degree  of  similarity  between  A  and 
B  is  not  reflected  in  B  and  thus  will  not  be  considered  in  reasoning  after  the  mapping; 

2)  potential  information  loss  because  other  similar  concepts  are  ignored  in  the  map¬ 
ping;  3)  it  cannot  handle  the  situation  where  A  itself  is  uncertain;  and  4)  it  does  not 
work  well  when  more  than  one  concepts  need  to  be  mapped.  To  see  the  last  point, 
consider  a  situation  where  concept  x  defined  as  intersection  of  A  and  B  in  ontol  is  to 
be  mapped  to  onto2.  Suppose  the  most  similar  concepts  to  A  in  onto2  are  C  and  D, 
and  those  to  are  B  are  E  and  D,  it  would  be  difficult  to  determine  which  of  the  three 
(C,  D,  and  E )  x  should  be  mapped  to. 

These  difficulties  in  ontology  mapping  can  be  dealt  with  properly  in  our  frame¬ 
work.  We  assume  that  pair-wise  similarity  measures  are  available  between  any  con¬ 
cepts  in  two  ontologies  ontol  and  onto2  (or  between  variables  in  BN1  and  BN2, 
respectively).  We  take  mapping  as  update  on  probability  distribution  of  variables  in 
BN2  by  distributions  of  variables  in  BN1  in  accordance  to  the  similarity  measures 
between  these  variables.  Further  inferences  (e.g.,  finding  the  most  probable  subsumer 
in  onto2  for  a  concept  defined  in  ontol)  can  be  drawn  by  Bayesian  inference  with  the 
updated  distribution  of  BN2.  We  present  our  approach  starting  with  the  basis:  1)  a 
notion  of  probabilistic  semantic  linkage  between  a  pair  of  concepts/variables;  2)  the 
“1  to  n”  mapping  (one  variable  in  BN1  mapped  to  multiple  similar  ones  in  BN2);  and 

3)  the  “m  to  n”  mappings  where  multiple  variables  in  BN  1  need  to  be  mapped. 


3.1  Pair-wise  Probabilistic  Semantic  Linkage 

We  assume  the  similarity  information  between  variable  A  in  BN1  and  B  in  BN2  is 
captured  by  the  joint  distribution  P(A,  B).  This  distribution  is  in  a  probability  space, 
denoted  as  PS1-2 ,  which  is  related  but  different  from  the  spaces  for  A  and  B.  denoted 
as  PS 1  and  PS2 ,  respectively.  Moreover,  since  this  measure  is  based  on  the  semantic 
similarity  intrinsic  to  the  meanings  of  these  two  variables,  P(A,  B)  is  assumed  invari¬ 
ant  with  respect  to  changes  in  PS 1  and  PS 1 .  That  is,  beliefs  on  variables  in  A  and  B 
may  change  when  evidence  is  presented  but  not  that  of  P(A,  B)  in  PS1-2 . 


Probabilistic  semantic  linkage  between  A  and  B,  which  serves  as  a  basis  mapping 
mechanism  between  similar  variables,  is  defined  as 

SL%  =  <PSl,  PS2 ,  A,  B ,  P(A,  B)>, 

where  A  e  PS 1 ,  and  B  e  PS2 ,  and  P(A,  B)  measures  the  semantic  similarity  be¬ 
tween  A  and  B.  Then  the  influence  to  B  by  A  via  the  single  linkage  SLl£B  changes 
P(B )  to  Q(B)  by  P(A).  This  update  can  be  viewed  as  twice  applications  of  Jeffrey’s 


rule  across  these  three  spaces,  first  from  PS 1  to  PS 12  then  PS 11  to  PS2 ,  as  depicted 
in  Figure  2  below.  Since  A  in  PS 1  is  identical  to  A  in  PS'-1 ,  PM)  in  PS '  becomes  soft 
evidence  Q(A )  to  PS1-2  by  (2.2),  the  distribution  of  B  in  PS1-2  is  updated  by  (2.3)  to 

Q(B)  =  ZaP(S\A)Q(A),  (3.1) 

Q(B)  is  then  applied  as  soft  evidence  from  PS1-2  to  node  B  in  PS2 ,  updating  distribu¬ 
tion  of  other  variables  C  in  PS2  by  (2.3)  as 

0(0  =  SsP(C  |  B)Q(B)  =  P(C  |  B)Za  P(B  |  A)P(A) .  (3.2) 


Fig.  2.  Mapping  concept  A  to  B  via  semantic  linkage  SLlAB 


3.2  Multiple  Semantic  Linkages 

Usually,  A  in  ontol  may  be  semantically  similar  to  more  than  one  concept  in  onto2. 
For,  example,  if  A  is  fairly  similar  to  B  in  onto2,  it  would  also  be  similar  to  all  super 
concepts  and  also  some  sub-concepts  of  B ,  possibly  with  different  similarity  measures. 
In  other  words,  mapping  A  to  BN2  amounts  mapping  it  through  all  semantic  linkages 
that  initiate  from  A  and  end  at  each  similar  concept  BJ  in  BN2.  Probabilistically,  BN2 
can  be  seen  as  receiving  n  soft  evidences,  one  for  a  linkage  from  A  to  BJ  for  each 
concept  BJ  in  BN2.  This  requires  1)  all  similarity  measures  P(A,  BJ)  remain  invariant, 
and  2)  conditional  dependencies  among  variables  in  BN2  also  remain  invariant.  This 
“1  to  n”  mapping  can  be  carried  out  by  a  process  that  combines  both  Jeffrey’s  rule 
and  IPFP.  Like  IPFP,  this  process  is  iterative  over  these  linkages  in  a  cycle  until  con¬ 
vergence. 

This  process  can  be  realized  by  generalizing  Pearl’s  virtual  evidence  approach  for 
soft  evidence  update  [15].  In  this  method  of  ours,  each  node  B'J  is  attached  a  virtual 
evidence  node.  At  iteration  step  k,  if  linkage  from  A  to  BJ  is  chosen,  then  we  first 
calculate  likelihood  LK(BJ)  for  virtual  evidence  node  veJ  that  will  be  used  to  simu¬ 
late  soft  evidence  Q(BJ )  by 

(3-3) 

and  then  apply  Jeffrey’s  rule  of  (3.1)  and  (3.2)  with  the  modified  likelihood  to  update 
variable  beliefs  in  BN2.  Note  that  (3.3)  is  the  same  as  (2.4)  except  for Qk_x(BJ) ,  the 
new  distribution  obtained  at  step  k- 1  is  used  rather  than  the  initial  P(BJ) .  Also  note 
that  this  process  does  not  explicitly  modify  the  joint  distribution  of  BN2  as  the  stan¬ 
dard  IPFP  would  do,  instead,  it  modifies  the  likelihood  associated  with  each  virtual 
evidence  node  veJ  while  keep  the  joint  distributions  P(A,  BJ)  and  CPT’s  in  BN2 


unchanged.  It  can  be  shown  that  when  the  process  converges,  beliefs  on  variables  in 
BN2  are  consistent  with  all  similarity  measures  P(A,  B1)  and  P(A  ).  the  belief  of  A  in 
BN1. 


Mapping  Reduction  Using  all  n  linkages  in  “1  to  n”  type  of  mapping,  as  described 
above,  is  computationally  very  expensive  because  the  IPFP  process  takes  a  number  of 
iterations  to  converge,  and  each  iteration  involves  belief  update  of  BN2,  which  itself 
is  exponential  to  the  size  of  BN2.  The  problem  gets  worse  for  “m  to  n”  type  of  map¬ 
ping  where  what  needs  to  be  mapped  is  a  composite  concept  that  is  defined  as  a  con¬ 
junction  (intersection)  of  several  variables  or  their  negations  in  BN1. 

Fortunately,  satisfying  a  given  probabilistic  relation  P(A,  B)  does  not  always  re¬ 
quire  the  use  of  a  linkage  from  A  to  B  or  even  know  what  the  linkage  looks  like.  Sev¬ 
eral  probabilistic  relations  may  be  satisfied  by  one  linkage.  Consider  a  simple  exam¬ 
ple  in  Figure  3  with  variables  A  and  B  in  BN\,  C  and  D  in  BN2,  and  similarity  (joint 
probabilities)  between  every  pair  as  below: 


P(C,A) 


<&)•  PID-A> 


0.33  0.18^| 
0.07  0.42 J  ’ 


PIC-B>-{qA6 


0.348  0.162^1 
0.112  0.378 J 
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A 


C 


True 

Folse 

R 

20.000 

80.000 

True 

Folse 

1  D 

40.000 

60.000 

A 

True 

Folse 

True 

100.00 

0.000 

False 

10.000 

90.000 

c 

True 

Folse 

True 

100.00 

0.000 

False 

30.000 

70.000 

Fig.  3.  Mapping  Reduction  Example 


However,  we  do  not  need  to  set  up  linkages  for  all  these  relations.  As  Figure  3  de¬ 
picts,  when  we  have  a  linkage  from  A  to  C,  all  these  relations  are  satisfied  (the  other 
three  linkages  are  thus  redundant).  This  is  because  not  only  beliefs  on  C,  but  also 
beliefs  on  D  are  properly  updated  by  the  mapping  A  to  C. 

Several  experiments  with  large  BNs  have  shown  that  only  a  very  small  portion  of 
all  n]  ■  n2  linkages  are  needed  in  satisfying  all  probability  constraints.  This,  we  sus¬ 
pect,  is  due  to  the  fact  that  some  of  these  constraints  can  be  derived  from  others  based 
on  the  probabilistic  interdependencies  among  variables  in  the  two  BNs.  We  are  cur¬ 
rently  actively  working  on  developing  a  set  of  rules  that  examine  the  BN  structures 
and  CPTs  so  that  redundant  linkages  can  be  identified  and  removed. 


4  Learning  Probabilities  from  Web  Data 

In  this  work,  we  use  prior  probability  distributions  P(C)  to  capture  the  uncertainty 
about  concepts  (i.e.,  how  likely  an  arbitrary  individual  belongs  to  class  C ),  condi- 


tional  distributions  P(C\D )  for  relations  between  C  and  D  in  the  same  ontology  (e.g., 
how  likely  an  arbitrary  individual  in  class  D  is  also  in  D’ s  subclass  C),  and  joint  prob¬ 
ability  distributions  P(A,B )  for  semantic  similarity  between  concepts  C  and  D  from 
different  ontologies.  Often  these  kinds  of  probabilistic  information  are  not  available 
and  are  difficult  to  obtain  from  domain  experts.  Our  solution  is  to  learn  these 
probabilities  using  text  classification  technique  ([3],  [12])  by  associating  a  concept 
with  a  group  of  sample  text  documents  called  exemplars.  The  idea  is  inspired  by 
those  machine  learning  based  semantic  integration  approaches  such  as  [7],  [11],  and 
[19]  where  the  meaning  of  a  concept  is  implicitly  represented  by  a  set  of  exemplars 
that  are  relevant  to  it. 

Learning  the  probabilities  for  semantic  similarity  between  concepts  in  two  ontolo¬ 
gies  is  straightforward,  assuming  we  have  sufficient  exemplars  of  good  quality  asso¬ 
ciated  with  each  concept.  First,  we  can  build  a  model  (classifier)  for  each  concept  in 
Ontology  1  according  to  the  statistical  infonnation  in  that  concept’s  exemplars  using 
a  text  classifier  such  as  Rainbow1 2  or  Bayesian  text  classifier  dbacl-.  Then  concepts  in 
Ontology  2  are  classified  into  classes  of  Ontology  1  by  feeding  their  respective  ex¬ 
emplars  into  the  models  of  Ontology  1  to  obtain  a  set  of  probabilistic  scores.  These 
scores  showing  the  inter-concept  similarity  in  a  probability  form.  Concepts  in  Ontol¬ 
ogy  1  can  be  classified  in  the  same  way  into  classes  of  Ontology  2.  This  cross¬ 
classification  process  (Figure  4)  helps  find  a  set  of  raw  mappings  between  Ontology  1 
and  Ontology  2.  Similarly,  we  can  obtain  prior  or  conditional  probabilities  related  to 
concepts  in  a  single  ontology  through  self-classification  with  the  models  learned  for 
that  ontology. 


Joint 

Probabilities 


Fig.  4.  Cross-classification  using  Text  Classifiers  on  Web  Data 


The  quality  of  these  text  classification  based  methods  is  highly  dependent  on  the 
quality  of  text  exemplars  to  each  concept,  which  together  should  well  capture  the 
meaning  of  the  concept.  Two  criteria  are  seen  to  be  crucial  in  assessing  the  quality  of 
exemplars:  each  exemplar  (at  least  most  of  them)  should  be  relevant  to  the  meaning 
of  the  concept,  and  that  these  exemplars  together  should  well  cover  all  aspects  of  that 
concept.  For  example,  articles  on  computer  games  are  very  relevant  to  the  concept  of 
“computer  applications”,  but  they  alone  hardly  cover  all  computer  applications. 


1  http://www-2.cs.cmu.edu/~mccallum/bow/rainbow 

2  http://www.lbreyer.com/ 


The  need  to  find  sufficiently  many  relevant  exemplars  for  a  large  number  of  con¬ 
cepts  greatly  reduces  the  attractiveness  and  applicability  of  these  machine  learning 
based  approaches.  It  would  be  a  very  time-consuming  task  for  knowledge  workers  to 
find  high  quality  text  exemplars  manually,  as  apparently  the  case  for  GLUE  [7].  Our 
approach  is  to  use  search  engines  such  as  Google3  to  retrieve  text  exemplars  for  each 
concept  node  automatically  from  WWW,  the  richest  information  resource  available 
nowadays.  The  goal  is  to  search  for  documents  in  which  the  concept  is  used  in  its 
intended  semantics.  The  rationale  is  that  the  meaning  of  a  concept  can  be  described  or 
understood  by  the  way  it  is  used. 

To  find  out  what  documents  are  relevant  to  a  term,  one  cannot  simply  use  the 
words  in  the  name  of  the  term  as  keywords  to  query  the  search  engine.  This  because  a 
word  may  have  multiple  meanings  (word  senses)  and  a  query  using  only  the  name  of 
the  term  in  attention  may  return  documents  related  to  a  meaning  different  from  the 
intended  semantics  of  the  term.  For  example,  in  an  ontology  for  “food”,  a  concept 
named  “apple”  is  a  subconcept  of  “fruit”.  If  one  only  uses  “apple”  as  the  keyword  for 
query,  documents  showing  how  to  make  an  apple  pie  and  how  to  use  an  iPod  may 
both  be  returned.  Clearly,  the  documents  using  “apple”  for  its  meaning  in  computer 
field  is  irrelevant  to  “apple”  as  a  fruit.  Fortunately,  since  we  are  dealing  with  concepts 
in  well  defined  ontologies,  the  semantics  of  a  term  is  to  a  great  extent  specified  by  the 
other  terms  used  in  defining  this  concept  in  the  ontology,  including  names  of  its  super 
and  subconcept  classes  and  the  properties  of  this  concept  and  its  super  classes.  This 
semantic  information  can  thus  be  used  to  guide  the  web  search  with  increased  rele¬ 
vancy.  There  are  a  number  of  ways  the  semantic  information  can  be  used  to  help 
search.  The  simplest  one,  and  the  one  we  have  experimented  so  far  is  to  form  search 
query  for  one  concept  by  combining  all  the  terms  on  the  path  from  root  to  that  con¬ 
cept  node  in  the  taxonomy.  In  the  “apple”  example,  the  query  would  then  become 
“food  fruit  apple”,  and  documents  about  iPod  and  Apple  computers  would  not  be 
returned. 

In  the  experiments,  for  each  concepts,  we  search  the  web  to  obtain  two  sets  of  ex¬ 
emplars:  UA+  containing  exemplars  that  support  (or  positively  related  to)  A;  and  UA~, 
containing  exemplars  that  support  the  negation  of  (or  negatively  related  to  )  A.  Exem¬ 
plars  in  Lr  are  obtained  by  searching  the  web  for  pages  that  contain  A  and  all  names 
of  A’s  ancestors  on  the  taxonomy,  while  that  for  UA~  are  obtained  by  search  pages  that 
contain  all  names  of  A’s  ancestors  but  not  A. 

With  all  these  documents,  we  can  obtain  joint  probabilities  of  A  and  B  by  text  clas¬ 
sification,  similar  to  what  is  done  in  GLUE  [7]:  applying  the  classifiers  of  concepts  A 
and  B  to  all  text  documents  in  U,  and  classify  them  into  four  categories:  UA+B+,  UA+B~, 
UA~B+,  and  UA~B~ .  Then  the  joint  probabilities  can  be  obtained  by  counting  the  items  in 
each  category,  e.g.,  P(A ,  B)=  \UA+B+\  /  |C/|.  If  we  only  search  for  positive  exemplars 
UA+  and  UB\  then  only  conditional  probability  P{B\A)  can  be  obtained  (by  applying 
B' s  classifier  to  A’s  supportive  exemplars  to  obtain  jjAJrBV  and  compute  P(B\A)  = 
\UA+B+\  /  |[/'4+|).  The  first  approach  is  the  one  that  works  for  our  purpose. 


3  http://www.google.com 


5  Experiments 


We  have  performed  computer  experiments  on  two  small-scale  real-world  ontologies. 
Our  goals  are  to  find  how  good  the  learning  can  be  with  the  exemplars  mined  from 
the  web,  and  how  the  uncertainty  inference  across  multiple  Bayesian  networks  could 
help  ontology  mapping. 

Translating  Taxonomies  to  BNs  We  took  the  Artificial  Intelligence  sub-domain 
from  ACM  Topic  Taxonomy4  and  DMOZ5  (Open  Directory)  hierarchies  and  pruned 
some  concepts  to  form  two  ontologies,  both  of  which  have  a  single  root  node  Artifi¬ 
cial  Intelligence.  All  other  concepts  in  the  hierarchies  are  sub  categories  of  AI.  These 
two  hierarchies  differ  in  both  terminologies  and  modeling  methods.  DMOZ  catego¬ 
rizes  concepts  by  popularities  of  web  pages  to  facilitate  people’s  easy  access  to  these 
pages,  while  ACM  topic  hierarchy  categorizes  concepts  from  super  to  sub  to  structure 
a  classification  primarily  for  academics. 


Table  1.  Statistics  of  the  expirements 


Taxono¬ 

mies 

# 

Nodes 

Depth 

Total  Exemplar 
size 

Avg.  Exem¬ 
plar  Size 

#  Exemplar 

Avg.  # 
Exp./node 

ACM  AI 

15 

3 

19.7  MB 

698  KB 

24533 

1636 

DMOZ  AI 

25 

3 

29.2  MB 

612  KB 

35148 

1406 

For  every  concept,  except  the  root,  we  obtained  exemplars  by  querying  Google  as 
described  in  the  previous  section.  The  statistics  of  these  web  pages  is  listed  in  Table  1. 
We  used  Bayesian  text  classifier  dbacl  to  create  a  model  for  each  non-root  concept  A 
and  obtained  the  pair-wise  conditional  probability  P(X  \  ParentlX)).  The  root  nodes 
were  assigned  a  prior  probability  as  (0.5,  0.5). 

Then,  using  BayesOWU s  translation  rules,  the  two  ontologies  were  translated  into 
two  BNs  as  shown  in  Figure  5. 


4  http://www.acm.org/class/1998/ 

5  http://dmoz.org/ 


Fig.  5.  Bayesian  network  for  ACM  topics’  AI  sub-domain  and  DMOZ’s  AI  sub-domain 


Learning  uncertainty  mappings  Raw  mappings  P(A,  B )  were  computed  for  each 
pair  of  concepts  of  the  two  BNs.  The  similarity  between  A  and  B  were  measured  by 
their  Jaccard  coefficient,  computed  from  the  joint  probability.  Table  2  lists  the  five 
most  similar  concepts  and  five  most  different  concepts  in  the  learning  result.  The  top 
three  most  similar  concepts  are  actually  identical  concepts.  However,  besides  these 
three,  another  pair  of  identical  concepts  is  not  measured  as  highly  related.  They  are 
/ Learning! Connectionism  &  Neural  Net  in  ACM  topic  and  /Machine  Learning/Neural 
Network  in  DMOZ.  Their  similarity  is  only  0.61.  We  speculate  this  is  because  the 
term  “connectionism”  is  not  as  popular  as  when  ACM  topic  hierarchy  was  con¬ 
structed,  and  thus  is  not  used  along  with  “ Neural  Network”  in  most  web  pages. 


Table  2.  Five  most  similar  concepts  and  most  different  concepts  in  the  learning  result.  The  root 
concept’s  name  is  omitted. 


ACM  topic 

DMOZ 

Similarity 

/Knowledge  Representation  &  Formalism  Method 

/Knowledge  Representation 

0.96 

/Natural  Language  Processing 

/Natural  Language 

0.90 

/Learning 

/Machine  Learning 

0.88 

/Learning 

/Knowledge  Representation 

0.81 

/Applications  &  Expert  System 

/Knowledge  Representation 

0.79 

/Fuzzy 

/Learning/ Analog 

0.03 

/Leaming/Induction 

/Learning/ Game 

0.02 

/Deduction  &  Theorem  Proving 

/Programming  Language/Declarative 

0.02 

/Leaming/Induction 

/Application 

0.01 

/Learning/  Anal  ogy 

Agent 

0.01 

Inference  with  BN  Mappings  Treating  ontology  mapping  as  Bayesian  network  map¬ 
ping  as  described  here  allows  us  to  conduct  probabilistic  reasoning  far  beyond  finding 
the  best  concept  match.  We  are  currently  actively  investigating  this  issue  and 
developing  related  algorithms.  To  illustrate  our  point,  consider  the  example  of  finding 
a  description  of  DMOZ’s  /Knowledge  Representation/ Semantic  Web  ( dmoz.sw )  in 
ACM  topic.  There  is  no  ACM  concept  that  is  identical  to  dmoz.sw ,  it  must  be  de¬ 
scribed  by  a  composite  expression  involving  multiple  ACM  concepts.  The  two  most 
semantically  similar  concepts  to  dmoz.sw  in  ACM  are  /Knowledge  Representation 
and  Formalism  Method/Relation  System  ( acm.rs )  and  /Knowledge  Representation 
and  Formalism  Method/ Semantic  Network  ( acm.sn )  with  the  joint  distributions 

P(dmoz.sw,  acm.rs)  = 

and  respective  Jaccard  coefficients  J(dmoz.sw,  acm.rs)  =  0.64,  and  J(dmoz.sw,  acm.sn) 
=  0.61. 

From  the  two  joint  probabilities,  we  can  see  that  dmoz.sw  is  not  a  subconcept  of  ei¬ 
ther  acm.rs  or  acm.sn,  but  had  a  sizable  overlap  with  each  of  them.  From  the  follow¬ 
ing  joint  probabilities 

^0.2612  0.0498"| 


( 0.60  0.12") 

0.58 

0.13" 

and  P( dmoz.sw, acm.sn)  = 

0.25 

0.04, 

0.2 1  0.07  J 

P {acm.rs,  acm.sn)  - 


0.0323  0.6557 J  ’ 


we  can  see  that  acm.rs  and  acm.sn  also  overlap  with  each  other.  Figure  6  illustrates 
the  overlap  of  these  three  concepts. 


dmoz.sw 


Fig.  6.  The  Venn  diagram  for  dmoz.sw,  acm.rs,  and  acm.sn 

This  leads  to  a  conjecture  that  dmoz.sw  may  be  described  in  terms  of  acm.rs  and 
acm.sn.  To  validate  this  conjecture,  we  need  to  have  the  conditional  probability 
P(acm.rs=  true,  acm.sn  =  true|  dmoz.sw  =  true).  This  can  be  obtained  as  follows. 

1.  Using  learned  probabilities  P{dmoz.sw,  acm.rs)  and  P(dmoz.sw,  acm.sn),  two 
semantic  linkage  were  created,  from  dmoz.sw  to  acm.rs  and  to  acm.sn,  respec¬ 
tively. 

2.  Instantiate  dmoz.sw  as  true,  and  compute  the  likelihoods  for  the  two  virtual  evi¬ 
dence  nodes  associated  with  acm.rs  and  acm.sn. 

3.  Compute  P(acm.rs=  true,  acm.sn  =  tme|  dmoz.sw  =  true)  by  any  Bayesian  network 
inference  algorithm  with  the  two  virtual  evidence  nodes  set  to  true. 

In  our  experiment,  this  probability  was  computed  to  be  0.851.  From  this  we  could 
conclude  that  intersection  of  acm.rs  and  acm.sn  is  the  highly  probable  subsumer  of 


dmoz.sw.  More  detailed  analysis  may  require  having  the  joint  distribution  of  the  three 
concept  nodes  (in  two  ontologies/BNs)  or  distribution  involving  additional  relevant 
ACM  concepts  (with  similarity  measure  lower  than  those  of  acm.rs  and  acm.sn ). 
These  distributions  can  be  computed  in  the  similar  fashion. 


6  Discussion  and  Future  Work 

This  paper  describes  our  ongoing  research  on  developing  a  probabilistic  framework 
for  automatic  ontology  mapping.  In  this  framework,  ontologies  (or  parts  of  them)  are 
first  translated  into  Bayesian  networks,  and  then  the  concept  mapping  is  realized  as 
evidential  reasoning  between  the  two  BNs  by  Jeffrey’s  rule.  The  probabilities  needed 
in  both  translation  and  mapping  can  be  obtained  by  using  text  classification  programs, 
supported  by  associating  to  individual  concepts  relevant  text  exemplars  retrieved 
from  the  web. 

We  are  currently  actively  working  on  each  of  these  components.  In  searching  for 
relevant  exemplars,  we  are  attempting  to  develop  a  measure  of  relevancy  so  that  less 
relevant  documents  can  be  removed.  We  are  also  investigating  how  semantic  infor¬ 
mation  can  be  utilized  to  post-process  text  documents  mined  from  the  web  so  that  less 
relevant  ones  can  be  identified  and  excluded.  We  are  expanding  the  ontology  to  BN 
translation  from  taxonomies  to  include  properties,  and  develop  algorithms  to  support 
common  ontology-related  reasoning  tasks.  As  for  a  general  BN  mapping  framework, 
our  current  focus  is  on  linkage  reduction.  We  are  also  working  on  the  semantics  of 
BN  mapping  and  examining  its  scalability  and  applicability.  Future  work  also  in¬ 
cludes  developing  methods  to  properly  deal  with  inconsistent  probability  constraints 
in  IPFP  process. 
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